Knowledge Transfer Session - Data Visualisation with Plotly


The tutorial is divided into two parts:

  1. In the first part we will learn how to create basic plots with plotly, how to create subplots, multiple axes or save our plots into a portable HTML file.

  2. In the second part, the attendees will be divided into 2-3 groups and work on new data trying to create visualisation for that. They will need to save it as HTML file and send to me so we can share what they did!


Tutorial Part 1.


In [23]:
import numpy as np
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly
import pandas as pd
plotly.offline.init_notebook_mode()

1. Scatter plot

Scatter plot is the simplest plot we have - they are in the form of single points on the x-y plot.

Reference: https://plotly.com/python/line-and-scatter/


1a) Generate data

We are using numpy package to generate random integers - for both x and y axis.

In [5]:
random_X = np.random.randint(1, 100, 1000)
random_y = np.random.randint(1, 100, 1000)

1b) Create simple scatter plot

We are creating the first and the simplest scatter plot.

The compulsory component we need to provide to create a plot is data component . What we do is:

1) create data component with go.Scatter() where we specify data on x and y axes, and type of plot

2) create figure with go.Figure(), where we provide the components for the plot (such as data component )

3) show the plot with fig.show()

In [6]:
data_component = go.Scatter(x=random_X,
                            y=random_y,
                            mode='markers')

fig = go.Figure(data=data_component)

fig.show()

1c) Improving the scatter plot - add title and axes labels

Other component that we can provide to plotly plot is layout component as go.Layout () which defines how the figure looks like. You provide layout component into go.Figure().

In [7]:
data_component = go.Scatter(x=random_X,
                            y=random_y,
                            mode='markers')

layout_component = go.Layout(title='My first scatter plot',
                             xaxis_title='random x',
                             yaxis_title='random y')

fig = go.Figure(data=data_component,
                layout=layout_component)

fig.show()

1d) changing the shape, colour and size of markers

You can change how to markers look like with parameter marker which is in the form of dictionary. Some of the parameters it takes are:

Colour: You can just say what colour you want - for example 'red'. Or you can provide a code for the color that you can find here: https://plotly.com/python/discrete-color/

Size: This is given in integers - for example 12.

Symbol: Symbol is a part of styling markers. You can find the entire list here: https://plotly.com/python/marker-style/

Opacity: Transparency of the markers (from 0 - invisible, to 1 - fully visible)

Line: Border of the markers (as a dictionary)

In [8]:
data_component = go.Scatter(x=random_X,
                            y=random_y,
                            mode='markers',
                            marker=dict(size=12,
                                        color='green',
                                        symbol='hexagon',
                                        opacity=0.5,
                                        line=dict(width=2,
                                                  color='red')))

layout_component = go.Layout(title='My first scatter plot',
                             xaxis_title='random x',
                             yaxis_title='random y')

fig = go.Figure(data=data_component,
                layout=layout_component)

fig.show()

2. Line plot

Line plot is created in a very similar way to scatter plot: the only difference is what you specify as mode : mode = 'lines' .

Reference: https://plotly.com/python/line-charts/


2a) Load the data

The data we will try to visualise for line charts is daily average temperature in May 2019 - May 2020 in London. I have uploaded them on my GitHub so you can easily load them straight in the notebook by using pandas as below. The data are stored in pandas dataframe (pd df).

In [10]:
data_source = 'https://raw.githubusercontent.com/kamiloster/plotly_workshop/main/temperature_london.csv'

df = pd.read_csv(data_source)

df
Out[10]:
date tavg tmin tmax
0 23/05/2019 16.9 9.0 24.7
1 24/05/2019 16.9 10.8 21.1
2 25/05/2019 18.6 13.9 23.3
3 26/05/2019 17.2 14.0 20.1
4 27/05/2019 14.2 9.0 18.6
... ... ... ... ...
360 17/05/2020 14.3 7.7 20.2
361 18/05/2020 16.7 10.2 23.9
362 19/05/2020 18.4 11.6 26.0
363 20/05/2020 20.1 13.4 27.2
364 21/05/2020 20.1 13.3 27.3

365 rows × 4 columns

2b) Plot 3 different modes: markers, lines and their combination

We will compare the 3 different modes available: markers (scatter plot), marker+lines (scatter with lines), and lines (line plot). To the temperature passed in the y axis, I have added 15 and 30 oC into some of them: just so you can see the difference between the modes (otherwise they would lay on top of each other).

In [11]:
trace_markers = go.Scatter(x=df['date'],
                           y=df['tavg'],
                           mode='markers',
                           name='markers')

trace_lines = go.Scatter(x=df['date'],
                         y=df['tavg'] + 15,
                         mode='lines',
                         name='lines - added 15 oC')

trace_markers_lines = go.Scatter(x=df['date'],
                                 y=df['tavg'] + 30,
                                 mode='markers+lines',
                                 name='markers+lines - added 30 oC')

data_component = [trace_markers, trace_lines, trace_markers_lines]

layout_component = go.Layout(title='Comparison of different modes: markers, lines and markers+lines',
                             xaxis_title='Date',
                             yaxis_title='Daily average temperature (oC)',
                             hovermode='x')

fig = go.Figure(data=data_component,
                layout=layout_component)

fig.show()

3. Bar chart

Bar chart can come in 3 different types: normal, stacked, nested.

Reference: https://plotly.com/python/bar-charts/


3a) Load the data

The data I prepared is for Winter Olympics in 2018. They summarise how many Gold/Silver/Bronze and Total medals were achieved for each country.

In [12]:
data_source = 'https://github.com/kamiloster/plotly_workshop/raw/main/2018WinterOlympics.csv'

df = pd.read_csv(data_source)

df
Out[12]:
Rank NOC Gold Silver Bronze Total
0 1 Norway 14 14 11 39
1 2 Germany 14 10 7 31
2 3 Canada 11 8 10 29
3 4 United States 9 8 6 23
4 5 Netherlands 8 6 6 20
5 6 Sweden 7 6 1 14
6 7 Republic of Korea 5 8 4 17
7 8 Switzerland 5 6 4 15
8 9 France 5 4 6 15
9 10 Austria 5 3 6 14
10 11 Japan 4 5 4 13
11 12 Italy 3 2 5 10
12 13 OAR 2 6 9 17
13 14 Czech Republic 2 2 3 7
14 15 Belarus 2 1 0 3
15 16 China 1 6 2 9
16 17 Slovakia 1 2 0 3
17 18 Finland 1 1 4 6
18 19 Great Britain 1 0 4 5
19 20 Poland 1 0 1 2
20 21 Hungary 1 0 0 1
21 21 Ukraine 1 0 0 1
22 23 Australia 0 2 1 3
23 24 Slovenia 0 1 1 2
24 25 Belgium 0 1 0 1
25 26 Spain 0 0 2 2
26 26 New Zealand 0 0 2 2
27 28 Kazakhstan 0 0 1 1
28 28 Latvia 0 0 1 1
29 28 Liechtenstein 0 0 1 1

3b) Normal bar chart

In [13]:
data_component = go.Bar(x=df['NOC'],
                        y=df['Total'])

layout_component = go.Layout(title='Medals in 2018 Olympics',
                             xaxis_title='Country',
                             yaxis_title='Total number of medals')

fig = go.Figure(data=data_component,
                layout=layout_component)

fig.show()

3c) Nested bar chart

In [14]:
trace_1 = go.Bar(x=df['NOC'],
                 y=df['Gold'],
                 name='Gold')

trace_2 = go.Bar(x=df['NOC'],
                 y=df['Silver'],
                 name='Silver')

trace_3 = go.Bar(x=df['NOC'],
                 y=df['Bronze'],
                 name='Bronze')

data_component = [trace_1, trace_2, trace_3]

layout_component = go.Layout(title='Medals in 2018 Olympics',
                             xaxis_title='Country',
                             yaxis_title='Number of medals')

fig = go.Figure(data=data_component,
                layout=layout_component)

fig.show()

3d) Stacked bar chart

In [15]:
trace_1 = go.Bar(x=df['NOC'],
                 y=df['Gold'],
                 name='Gold',
                 marker=dict(color='gold'))

trace_2 = go.Bar(x=df['NOC'],
                 y=df['Silver'],
                 name='Silver',
                 marker=dict(color='silver'))

trace_3 = go.Bar(x=df['NOC'],
                 y=df['Bronze'],
                 name='Bronze',
                 marker=dict(color='brown'))

data_component = [trace_1, trace_2, trace_3]

layout_component = go.Layout(title='Medals in 2018 Olympics',
                             xaxis_title='Country',
                             yaxis_title='Number of medals',
                             barmode='stack')


fig = go.Figure(data=data_component, 
                layout=layout_component)

fig.show()

5. Box plots

Box plots are very important in statistical analysis. They show you how the data are distributed around mean/median, standard deviation and upper/lower whiskers (limits).

Reference: https://plotly.com/python/box-plots/


5a) Load the data

Abalone dataset is very popular in machine learning. Abalone is a type of shellfish. Often, their age is determined by counting the rings on the shell. Machine learning was used to correlate this age with other properties: length, diameter, height, and others. It is important to visualise the statistics of these properties.

In [16]:
data_source = 'https://github.com/kamiloster/plotly_workshop/raw/main/abalone.csv'

df = pd.read_csv(data_source)

df
Out[16]:
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
... ... ... ... ... ... ... ... ... ...
4172 F 0.565 0.450 0.165 0.8870 0.3700 0.2390 0.2490 11
4173 M 0.590 0.440 0.135 0.9660 0.4390 0.2145 0.2605 10
4174 M 0.600 0.475 0.205 1.1760 0.5255 0.2875 0.3080 9
4175 F 0.625 0.485 0.150 1.0945 0.5310 0.2610 0.2960 10
4176 M 0.710 0.555 0.195 1.9485 0.9455 0.3765 0.4950 12

4177 rows × 9 columns

5b) Other ways to create plotly plots

You can also add traces to already created figure go.Figure() with fig.add_trace(), and then update the layout with fig.update_layout():

1) Create figure fig = go.Figure()

2) Add all the plots you want with fig.add_trace()

3) Update the layout with fig.update_layout()

4) Show the figure fig.show()

In [17]:
fig = go.Figure()

fig.add_trace(go.Box(y=df['length'],
                     name='Length'))

fig.add_trace(go.Box(y=df['diameter'],
                     name='Diameter'))

fig.add_trace(go.Box(y=df['height'],
                     name='Height'))

fig.add_trace(go.Box(y=df['whole_weight'],
                     name='Whole weight'))

fig.update_layout(title='Box plots for basic properties of abalone shellfish',
                  xaxis_title='Property',
                  yaxis_title='Property value')

fig.show()

6. Histogram

Reference : https://plotly.com/python/histograms/


In [18]:
fig = go.Figure()

fig.add_trace(go.Histogram(x=df['length'],
                           name='Length'))

fig.add_trace(go.Histogram(x=df['diameter'],
                           name='Diameter'))

fig.add_trace(go.Histogram(x=df['height'],
                           name='Height'))

fig.add_trace(go.Histogram(x=df['whole_weight'],
                           name='Whole weight'))

fig.update_layout(title='Histogram for basic properties of abalone shellfish',
                  xaxis_title='Bin',
                  yaxis_title='Property count')

fig.show()
In [19]:
fig = go.Figure()

fig.add_trace(go.Histogram(x=df['length'],
                           name='Length'))

fig.add_trace(go.Histogram(x=df['diameter'],
                           name='Diameter'))

fig.add_trace(go.Histogram(x=df['height'],
                           name='Height'))

fig.add_trace(go.Histogram(x=df['whole_weight'],
                           name='Whole weight'))

fig.update_layout(title='Histogram for basic properties of abalone shellfish',
                  xaxis_title='Bin',
                  yaxis_title='Property count',
                  barmode='stack')

fig.show()

7. Heat maps

Heat maps are very useful to visualise correlations between x-y-z - for example Pearson correlation coefficient.

Reference: https://plotly.com/python/heatmaps/


7a) Load the data

The next dataset we are looking at is hourly temperature average in Santa Barbara (in California).

In [20]:
data_source = 'https://github.com/kamiloster/plotly_workshop/raw/main/2010SantaBarbaraCA.csv'

df = pd.read_csv(data_source)

df
Out[20]:
LST_DATE DAY LST_TIME T_HR_AVG
0 20100601 TUESDAY 0:00 12.7
1 20100601 TUESDAY 1:00 12.7
2 20100601 TUESDAY 2:00 12.3
3 20100601 TUESDAY 3:00 12.5
4 20100601 TUESDAY 4:00 12.7
... ... ... ... ...
163 20100607 MONDAY 19:00 15.6
164 20100607 MONDAY 20:00 14.8
165 20100607 MONDAY 21:00 14.3
166 20100607 MONDAY 22:00 14.4
167 20100607 MONDAY 23:00 14.6

168 rows × 4 columns

In [21]:
fig = go.Figure()

fig.add_trace(go.Heatmap(x=df['DAY'],
                         y=df['LST_TIME'],
                         z=df['T_HR_AVG']))

fig.update_layout(title='Hourly average across the week in Santa Barbara (California)',
                  xaxis_title='Day of the week',
                  yaxis_title='Hour of the day')

fig.show()

8. Shared axes

Sometimes, when we plot variables that have significant differences in their values, it makes more sense to create two y axis (or two x axis).

Reference: https://plotly.com/python/multiple-axes/


In [22]:
data_source = 'https://raw.githubusercontent.com/kamiloster/plotly_workshop/main/temperature_flow_rate.csv'

df = pd.read_csv(data_source)

df
Out[22]:
Date Temperature Flow rate
0 24/07/2016 565.930603 17.128586
1 25/07/2016 568.573181 17.151127
2 25/07/2016 567.713318 17.263832
3 26/07/2016 567.783081 17.209528
4 26/07/2016 568.730286 17.241804
... ... ... ...
994 18/03/2017 711.218689 9.588627
995 18/03/2017 711.049438 15.312500
996 18/03/2017 689.779785 18.983606
997 19/03/2017 686.987915 9.658299
998 19/03/2017 692.987366 9.752049

999 rows × 3 columns

In [24]:
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(go.Scatter(x=df['Date'],
                         y=df['Temperature'],
                         name='Temperature'),
              secondary_y=False)

fig.add_trace(go.Scatter(x=df['Date'],
                         y=df['Flow rate'],
                         name='Flow rate'),
              secondary_y=True)

fig.update_layout(title_text="Plot with two axes: temperature and flow rate")
fig.update_xaxes(title_text="Date")
fig.update_yaxes(title_text="Temperature", secondary_y=False)
fig.update_yaxes(title_text="Flow rate", secondary_y=True)

fig.show()

9. Creating subplots

Subplots are very useful when we want to compare different types of plots in one space.

Reference: https://plotly.com/python/subplots/


9a) Load the data and get correlation map with df.corr()

In [25]:
data_source = 'https://raw.githubusercontent.com/kamiloster/plotly_workshop/main/temperature_london.csv'

df = pd.read_csv(data_source)

correlation_map = df.corr()

df
Out[25]:
date tavg tmin tmax
0 23/05/2019 16.9 9.0 24.7
1 24/05/2019 16.9 10.8 21.1
2 25/05/2019 18.6 13.9 23.3
3 26/05/2019 17.2 14.0 20.1
4 27/05/2019 14.2 9.0 18.6
... ... ... ... ...
360 17/05/2020 14.3 7.7 20.2
361 18/05/2020 16.7 10.2 23.9
362 19/05/2020 18.4 11.6 26.0
363 20/05/2020 20.1 13.4 27.2
364 21/05/2020 20.1 13.3 27.3

365 rows × 4 columns

9b) Create subplot

In [26]:
fig = make_subplots(rows=2,
                    cols=2,
                    vertical_spacing=0.2,
                    subplot_titles=('Line plot',
                                    'Heatmap',
                                    'Histogram',
                                    'Box plot'))

# Line plots - 1 x 1
fig.add_trace(go.Scatter(x=df['date'],
                         y=df['tmin'],
                         name='Minimum temperature',
                         mode='lines',
                         marker=dict(color='#B6E880')),
              row=1,
              col=1)

fig.add_trace(go.Scatter(x=df['date'],
                         y=df['tmax'],
                         name='Maximum temperature',
                         mode='lines',
                         marker=dict(color='#17BECF')),
              row=1,
              col=1)

fig.add_trace(go.Scatter(x=df['date'],
                         y=df['tavg'],
                         name='Average temperature',
                         mode='lines',
                         marker=dict(color='black')),
              row=1,
              col=1)

# Heatmap - 1 x 2
fig.add_trace(go.Heatmap(z=correlation_map,
                         x=df.columns[1:],
                         y=df.columns[1:],
                         showscale=False),
              row=1,
              col=2)


# Histograms - 2 x 1
fig.add_trace(go.Histogram(x=df['tavg'],
                           name='Average temperature',
                           marker=dict(color='black'),
                           showlegend=False),
              row=2,
              col=1)

fig.add_trace(go.Histogram(x=df['tmin'],
                           name='Minimum temperature',
                           marker=dict(color='#B6E880'),
                           showlegend=False),
              row=2,
              col=1)

fig.add_trace(go.Histogram(x=df['tmax'],
                           name='Maximum temperature',
                           marker=dict(color='#17BECF'),
                           showlegend=False),
              row=2,
              col=1)

# Box plots - 2 x 2
fig.add_trace(go.Box(y=df['tavg'],
                     name='Average temperature',
                     marker=dict(color='black'),
                     showlegend=False,
                     boxpoints='all'),
              row=2,
              col=2)

fig.add_trace(go.Box(y=df['tmin'],
                     name='Minimum temperature',
                     marker=dict(color='#B6E880'),
                     showlegend=False,
                     boxpoints='all'),
              row=2,
              col=2)

fig.add_trace(go.Box(y=df['tmax'],
                     name='Maximum temperature',
                     marker=dict(color='#17BECF'),
                     showlegend=False,
                     boxpoints='all'),
              row=2,
              col=2)

fig.update_layout(legend=dict(y=1.3, x=0),
                  barmode='stack')

fig.update_yaxes(title_text='Temperature (oC)',
                 row=1,
                 col=1)

fig.update_xaxes(title_text='Date',
                 showticklabels=False,
                 row=1,
                 col=1)

fig.update_yaxes(title_text='Count',
                 row=2,
                 col=1)

fig.update_xaxes(title_text='Bin',
                 row=2,
                 col=1)

fig.update_yaxes(title_text='Temperature (oC)',
                 row=2,
                 col=2)

fig.write_html('C:/Users/kamil/Documents/KTS/figure.html')

fig.show()
In [ ]: